Skip to content

feat: add replay support to runner group and fix replay duration overrun#235

Open
JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
JasonXuDeveloper:replay/pr6
Open

feat: add replay support to runner group and fix replay duration overrun#235
JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
JasonXuDeveloper:replay/pr6

Conversation

@JasonXuDeveloper
Copy link
Contributor

@JasonXuDeveloper JasonXuDeveloper commented Feb 15, 2026

Summary

  • Runner group handler: Refactor buildBatchJobObject to support replay mode — skip configmap upload for replay, mount PVC for local profiles, set REPLAY_PROFILE_SOURCE env var, use /run_replay.sh entrypoint
  • run_replay.sh: Entrypoint script for replay runner pods that invokes kperf runner replay and uploads results (with proper quoting and --data-binary)
  • Dockerfile: Copy run_replay.sh and chmod +x scripts
  • Fix duration overrun: Enforce a hard context deadline at profile.Duration() + 30s in both Schedule and ScheduleSingleRunner so replays can't exceed the profile time
  • Context cancellation as success: Treat context.Canceled/DeadlineExceeded as success for all verbs — when the deadline fires, in-flight requests are cancelled intentionally (not failures), and this avoids the ObserveFailure() mutex thundering-herd at shutdown
  • Remove unnecessary atomics: Change per-worker metrics from int32/atomic to plain int (each goroutine owns its instance)
  • Simplify WATCH goroutines: Fire-and-forget execution, remove redundant context check and concurrent metric writes
  • Worker formula: Update recommendedWorkers from conns * 3 to QPS-based calculation
  • Runner hardening: Guard against empty restClis in startWorkers
  • Scheduler hardening: Validate runnerIndex bounds in ScheduleSingleRunner

Test plan

  • go build ./... passes
  • go vet ./... passes
  • go test ./replay/... passes
  • End-to-end: run a 15-minute replay profile and confirm it completes in ~15 minutes

Part 6 of 6 in the replay feature PR stack. Depends on PR #234.

🤖 Generated with Claude Code

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “replay mode” execution to kperf, spanning runner-group deployment changes (indexed Jobs + replay entrypoint), a new replay engine/package (profile loading, partitioning, scheduling, runner), and CLI wiring for both local replay runs and distributed runner pods.

Changes:

  • Add replay profile types + loader, scheduler, runner, partitioning, and request builder under replay/.
  • Extend runnergroup deployment to support replay mode (skip configmap upload, mount PVC optionally, run /run_replay.sh in indexed Jobs).
  • Add CLI commands for replay (kperf replay run local mode, kperf runner replay for runner pods) and refactor latency percentile reporting.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
testdata/sample_replay_runnergroup.yaml Example RunnerGroupSpec for replay mode (URL/PVC profile sources).
testdata/sample_replay.yaml Sample replay profile YAML with realistic request sequence.
scripts/run_replay.sh New runner-pod entrypoint for replay mode + result upload loop.
runner/group/handler.go Replay-aware job building (script, env, PVC mount) + skip configmap upload in replay mode.
replay/schedule_test.go Tests for result aggregation and config warning validation behavior.
replay/schedule.go Local multi-runner scheduler + single-runner entry for distributed mode + aggregation/warnings.
replay/runner_test.go Unit tests/benchmarks for runner internals (bucket sizing, indexing).
replay/runner.go Replay runner implementation (worker pool, WATCH handling, metrics).
replay/partition_test.go Tests for deterministic partitioning and per-object ordering.
replay/partition.go Partitioning logic (FNV-1a by object key) + distribution analysis helpers.
replay/loader_test.go Tests for loading profiles from file/gzip and validation errors.
replay/loader.go Profile loader supporting local paths and HTTP(S) + gzip auto-detection.
replay/builder_test.go Tests for request building, masking, method mapping, query handling.
replay/builder.go Request builder/executor using rest.Interface + masking for metric aggregation.
metrics/utils.go New helper to build percentile latency reports (aggregate + per-URL) with optional raw data.
cmd/kperf/commands/runnergroup/run.go Avoid clobbering nodeAffinity unless CLI flags are provided.
cmd/kperf/commands/runner/runner.go Add runner replay subcommand + reuse percentile-report helper + config validation.
cmd/kperf/commands/root.go Register top-level replay command.
cmd/kperf/commands/replay/run_test.go Tests for local replay report building (with/without raw data).
cmd/kperf/commands/replay/run.go Implement kperf replay run local-mode command and JSON reporting.
cmd/kperf/commands/replay/root.go Add replay CLI root with run subcommand.
api/types/runner_group.go Add replay fields to RunnerGroupSpec + IsReplayMode().
api/types/replay_test.go Tests for replay types validation and helpers.
api/types/replay.go Define replay profile/request/spec types + validation + duration helper.
api/types/load_traffic.go Fix typo in comment (“target”).
Dockerfile Include /run_replay.sh and ensure scripts are executable.

@JasonXuDeveloper JasonXuDeveloper force-pushed the replay/pr6 branch 18 times, most recently from dd4ffe3 to d03afca Compare February 18, 2026 22:54
@JasonXuDeveloper JasonXuDeveloper changed the title feat: add replay support to runner group and deployment infrastructure feat: add replay mode to runner group deployment and harden runner/scheduler Feb 18, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

scripts/run_replay.sh:33

  • The log message Uploaded it is not very actionable when debugging uploads (it doesn’t include which file/runner or where it was uploaded). Consider logging the runner identity and/or target URL (and possibly the HTTP status) to make successful uploads traceable in pod logs.
      echo "Uploaded it"
      exit 0
      ;;

scripts/run_replay.sh:40

  • The message Leaking pod? skip is ambiguous/colloquial and makes it hard to understand the actual failure mode (404 from the result upload endpoint). Consider rewording to explicitly state that the runner is not recognized by the server (404) and that the pod is exiting as a result.
    404)
      echo "Leaking pod? skip"
      exit 1;

Distributed replay mode integration:
- Replay-aware job building: skip configmap upload for replay mode,
  use indexed Jobs for runner assignment, custom replay entrypoint script
- run_replay.sh: entrypoint script for replay runner pods that downloads
  the replay profile and invokes kperf runner replay
- Dockerfile: chmod +x for scripts directory

Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

…ring herd

Treat context cancellation as success for all verbs (not just WATCH) to
prevent mutex contention at shutdown. Remove unnecessary atomic operations
on per-worker metrics and simplify WATCH goroutines to fire-and-forget.
Update worker recommendation formula to be QPS-based.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JasonXuDeveloper JasonXuDeveloper changed the title feat: add replay mode to runner group deployment and harden runner/scheduler feat: add replay support to runner group and fix replay duration overrun Feb 19, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

…Runner

Both Schedule and ScheduleSingleRunner were called with context.Background()
and never enforced a hard deadline. The replay would run until every request
completed naturally (up to 60s timeout each), causing 15-min profiles to run
20+ minutes.

Now both functions create a context.WithTimeout based on profile.Duration()
plus a 30s grace period. When the deadline fires, in-flight requests get
context.Canceled (treated as success per the previous commit), and WATCH
connections are torn down immediately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants